Correlation and Regression Models

Part 3: ANOVA and General Linear Model

Esteban Montenegro-Montenegro, PhD

Psychology and Child Development

Why are we studying the regression model and other demons?

  • Regression is only one member of the family of General Linear Model (GLM).

  • The GLM covers models such as Analysis of Variance (ANOVA), Multivariate Analysis of Variance (MANOVA), Classical Regression Model (linear regression) and \(t\)-test.

  • My aim was to introduce these topics as a generalization of GLM, or better said, as a family.

  • There is a large theory related to these GLM models, but we don’t have time to discuss every detail. More advanced classes are necessary to explain particular details.

  graph TD
    A[GLM] --> B(ANOVA)
    A[GLM] --> C(MANOVA)
    A[GLM] --> D(t-test)
    A[GLM] --> E(Linear Regression)

Let’s make up time for ANOVA

  • Analysis of Variance a.k.a ANOVA is a model that helps to analyze mean differences in multiple groups. It is actually very similar to the Classical Regression Model, that’s why I included ANOVA in the Classical Regression Model lecture.

  • The ANOVA model tests the null hypothesis that all group means are equal. It could be represented like this:

\[\begin{equation} H_{0}: \mu_{1} = \mu_{2} = \mu_{3} \end{equation}\]

  • This would be what we call an omnibus test, which means we are testing the overall effect of our independent variable on the dependent variable.

  • As always, let’s understand this with an example:

Wage's mean, sd, and variance by Marital Status
maritl M SD variance
1. Never Married 92.73 32.92 1083.73
2. Married 118.86 43.12 1859.38
3. Widowed 99.54 23.74 563.64
4. Divorced 103.16 33.80 1142.51
5. Separated 101.22 33.66 1133.22
  • In the table above we can see the mean of wage by marital status, there are males who never married, married, widowed, divorced or separated. ANOVA will help us to answer the question:

Are the mean differences in wage by marital status explainable by chance alone?

  • In this case ANOVA could tell us: Yes! there is a difference, but where?

  • In this scenario, ANOVA will allow us to do pairwise comparisons , this means; we could compare the mean of divorced males versus the mean of married males, but also test all the combinations later on. This is called a post hoc analysis.

Let’s make up time for ANOVA (cont.)

  • The ANOVA follows the logic of variance decomposition:

\[\begin{equation} outcome_{i} = (model) + error_{i} \end{equation}\]

  • This means, that ANOVA accounts for the variances within your groups, and also calculates the variance that is explained by your MODEL. In fact, the Classical Regression Model does exactly the same thing.

  • Remember: The line of best fit in regression is estimated adding the sum of square distances from the line. The line with the lowest sum of distances is the best line in regression.

  • We are going to do something similar in ANOVA, it is called the Total Sum of Squares (\(SS_{T}\)), this calculation will give you the distance from the grand mean:

\[\begin{equation} SS_{T} = \sum^N_{i=1}(x_{i}-\bar{x}_{(grandMean)}) \end{equation}\]

  • The term \(x_{i}\) is the score or value for each observation, and \(\bar{x}_{(grandMean)}\) is the overall mean.

Let’s make up time for ANOVA (cont.)

  • We can continue with our example with wage \(~\) maritl.

  • We can calculate the grand mean of wage:

WageGrandMean <- mean(Wage$wage)
WageGrandMean 
[1] 111.7036
  • Now we could estimate \(SS_{t}\) in R:
sstData <- Wage |> select(wage, maritl) |> 
  mutate(SSt = (wage - mean(wage))^2)
  
sum(sstData$SSt)
[1] 5222086

The total \(SS\) is 5222086. This is the TOTAL variation within the data.

  • We also need to estimate the Model Sum of Squares:

\[\begin{equation} SS_{M} = \sum^k_{n=1}n_{k}(\bar{x}_{k}-\bar{x}_{grandMean})^2 \end{equation}\]

  • \(k\) represents each group, in simple words we are estimating the mean difference of each group from the overall mean. That’s the distance from the grand mean.

  • We will need to estimate the residuals of the our model, the residuals is variance not explained by our ANOVA model. The estimation is:

  • In this estimation we estimate the difference of each observation from the group mean where the observation is located. So, for instance if Katie was in the divorced group, then we could compute: \(wage_{katie} - mean(wage_{divorced})\). This is an indicator of variation not explained by the model. This means, ANOVA does not explain what happens within each group’s variation, it accounts for the between group variation.

Time for examples !

Show the code
modelAnova <- aov(wage ~ maritl, data = Wage)
summary(modelAnova)
              Df  Sum Sq Mean Sq F value Pr(>F)    
maritl         4  363144   90786   55.96 <2e-16 ***
Residuals   2995 4858941    1622                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • The result showed that the difference is not explained by chance alone. But wait, this is an omnibus test. It just tell me that at least one mean is different.

Time for examples !

Show the code
  ggplot(data = Wage,aes(x=maritl, y=wage, fill=maritl)) +
    geom_boxplot() +
   stat_summary(fun="mean", color="red")+
    scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
    theme_classic()+
   theme(legend.position = "none")+
  labs(x= "Marital Status", y= "Wage", title = "Boxplot of Wage by Marital Status")

Time for examples !

  • Let’s see the pairwise comparisons:
Show the code
pairTest <-TukeyHSD(modelAnova)




as.data.frame(lapply(pairTest, function(x) round(x,2))) %>%
  rename(meanDiff = maritl.diff,
          lower = maritl.lwr ,
         upper = maritl.upr ,
         pvalueAdj = maritl.p.adj
         ) %>%
 kbl(caption = "Pairwise Contrast") %>%
kable_classic_2("hover", full_width = T, 
                bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Pairwise Contrast
meanDiff lower upper pvalueAdj
2. Married-1. Never Married 26.13 21.18 31.07 0.00
3. Widowed-1. Never Married 6.80 -18.78 32.39 0.95
4. Divorced-1. Never Married 10.42 1.60 19.25 0.01
5. Separated-1. Never Married 8.48 -6.96 23.92 0.56
3. Widowed-2. Married -19.32 -44.66 6.02 0.23
4. Divorced-2. Married -15.70 -23.77 -7.63 0.00
5. Separated-2. Married -17.64 -32.66 -2.63 0.01
4. Divorced-3. Widowed 3.62 -22.75 29.99 1.00
5. Separated-3. Widowed 1.68 -27.58 30.93 1.00
5. Separated-4. Divorced -1.94 -18.65 14.76 1.00